Homework 3

Instructions
  • Please submit your responses as a nicely formatted notebook (.ipynb) file.
  • Show your work for calculation questions (i.e., show the R code you used to answer the question).
  • Make sure your notebook runs without errors and shows all required outputs.
  • If your answer includes many decimal places (e.g., 3.14159265358979), please round it to a reasonable number of decimals for readability (typically 3 to 4).
  • Avoid showing the outputs of unnecessary values.
  • Plots must be made using ggplot2
  • Only the following packages are permitted:
    • tidyverse

Question 1

Calculate \(\mu\), \(\text{ss}\), \(\sigma^2\), and \(\sigma\) for the sons’ height data (in inches) used in the lecture. You are NOT allowed to use the function mean, sd(), or var().

library(tidyverse)
heights <- read_csv("data/PearsonHeightData.csv")

n <- nrow(heights)

# Mean
m <- sum(heights$Son) / n
m
[1] 68.68423
# Sum of Squared Deviations (SS)
ss <- sum((heights$Son - m)^2)
round(ss, 5)
[1] 8541.632
# Variance
var <- ss/(n)
round(var, 5)
[1] 7.92359
# Standard Deviation
s <- sqrt(var)
round(s, 5)
[1] 2.81489


Question 2

Calculate \(\bar{x}\), \(\text{ss}\), \(s^2\), and \(s\) of the following data. You are not allowed to use the functions mean(), sd(), or var().

676, 692, 661, 692, 637, 676, 649, 656, 643, 665, 694, 643, 677
x <- c(676, 692, 661, 692, 637, 676, 649, 656, 643, 665, 694, 643, 677)
n <- length(x)

# Mean
m <- sum(x)/n
round(m, 5)
[1] 666.2308
# Sum of Squared Deviations
ss <- sum((x - m)^2)
round(ss, 5)
[1] 4770.308
# Variance
var <- ss/(n - 1)
round(var, 5)
[1] 397.5256
# Standard Deviation
s <- sqrt(var)
round(s, 5)
[1] 19.93805


Question 3

Assuming the data from the previous question is normally distributed, calculate the points the middle 95% of the data fall within. Do not use the functions pnorm() or qnorm(); instead, use the values in the graph below (from the lecture) to determine the middle boundry.

Normal Ranges
# Lower Bound
lower <- m - 1.96 * s

# Upper Bound
upper <- m + 1.96 * s
[1] "[627.1522, 705.30934]"

If you would like to display the output as an interval like I have done, you can just use paste0() function:

paste0("[", round(lower, 5), ", ", round(upper, 5), "]") 


Question 4

Assuming the data from question 2 is normally distributed, calculate the points the middle 68% of the data fall within. Do not use the functions pnorm() or qnorm(); instead, use the graph that was provided.

# Lower Bound
lower <- m - 1 * s

# Upper Bound
upper <- m + 1 * s

paste0("[", round(lower, 5), ", ", round(upper, 5), "]") 
[1] "[646.29272, 686.16881]"


Question 5

Use R code to calculate \(\mu\), \(\text{ss}\), \(\sigma^2\), and \(\sigma\) for the following numbers:

10, 10, 10, 10, 10, 10, 10, 10, 10, 10

Why do you think the answer turned out the way it did?

x <- c(10, 10, 10, 10, 10, 10, 10, 10, 10, 10)
n <- length(x)

# Mean
m <- sum(x)/n
m
[1] 10
# Sum of Squared Deviations
ss <- sum((x - m)^2)
ss
[1] 0
# Variance
v <- ss/(n)
v
[1] 0
# Standard Deviation
s <- sqrt(v)
s
[1] 0

The data has no spread (i.e. does not vary/deviate); hence our statistical measures of spread (the sum of squared deviations, variance, and standard deviation) are just 0.

Important

If you used sd() or var() in your R code, you did this wrong. Those functions calculate the sample standard deviation and sample variance, respectively.



Question 6

Which of the following are measures of spread?

  • Range
  • IQR
  • Quartiles
  • Median
  • Mean
  • Normalized Dispersion Quotient
  • Sum of Squared Deviations
  • The “Eh, Close Enough” Range
  • Variance
  • Standard Deviation
  • Z-score
  • Standardized Wobble

Range, IQR, Sum of Squared Deviations, Variance, Standard Deviation.



Question 7

Using the Pearson height data from the lecture, transform the sons’ scores to have a mean of 500 and a standard deviation of 500. Then plot a histogram of the new distribution with an interesting fill and edge colour.

# Z-scores
heights$z_scores <- (heights$Son - mean(heights$Son)) / sd(heights$Son)

# New Scores
heights$son_t <- heights$z_scores * 500 + 500

# Plot
ggplot(heights, aes(x = son_t)) + 
  geom_histogram(fill = "#98EDE8", colour = "#249393") +
  xlab("New Son Heights") +
  ylab("Count")



Question 8

From the course website, download the dataset Pokemon.csv. This data set presents statistics for each of the pokémon characters in the popular video game franchise “Pokémon.”

Use this data set to plot a density plot for the variable “HP.” Give the density plot a fill colour of “Pikachu yellow”, which is defined by Nintendo as “#F6CF57.”

  • HP stands for hit points, or health points, and defines how much damage a pokémon can withstand before fainting.

Some of the columns in this dataset contain Japanese characters. If you open this in Microsoft Excel, there is a good chance they won’t display properly. This is because, by default, Microsoft Excel does not read CSV files using UTF-8 character encoding despite that being the accepted standard across the world. Why? I don’t know - but it is probably because they want you to use .xlsx files and simply don’t care about anything that isn’t controlled by Microsoft. Note, if you want a spreadsheet software that will use UTF-8 character encoding by default, try Libre Office or even Google Sheets.

poke <- read_csv("data/Pokemon.csv")
pikachu <- '#F6CF57'

ggplot(poke, aes(x = hp)) +
  geom_density(fill = pikachu, colour = "black") +
  xlab("Hit Points (HP)") +
  ylab("Density")



Question 9

Plot a Q-Q plot of the pokémon HP column and indicate whether or not you believe the data is normally distributed (describe why you believe this - appeal to both the density and Q-Q plot).

Make sure the Q-Q-plot incorporates Pikachu yellow in some way.

ggplot(poke, aes(sample = hp)) +
  stat_qq(
    size = 3,
    shape = 21,
    fill = pikachu
  ) +
  stat_qq_line() +
  labs(
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  )

The data is not acceptably normal. The density plot shows a long tail in the positive direction which indicates the presense of values that are likely to pull the mean and inflate the standard deviation. The Q-Q plot shows large portions of the ends of the distribution not aligning diagonally like you would expect normally distributed data to.

Note for grading
  • It’s also fine to label the x and y axes as “Expected Quantiles (or Values)” and “Observed Quantiles (or Values)” respectively. The y-axis could also be labelled as its original units (e.g., “HP Values”).


Question 10

Use Tukey’s Ladder of Powers to find a transformation that makes the pokémon HP data more normal/Gaussian. Specify which value of lambda (\(\lambda\)) you used and plot a histogram of the transformed data. Ensure that your x-axis label communicates the transformation you applied.

The function transformTukey() is NOT allowed to be used for this question. You need to experiment and walk up/down the ladder yourself. (You do not need to show all your attempts, just show the final result). An exhaustive search for the exact value of lambda is unnecessary. Use only the values provided in the lecture table.

poke$hp_t <- poke$hp^(1/2)

ggplot(poke, aes(x = hp_t)) +
  geom_histogram(fill = pikachu, colour = "black") +
  labs(
    x = expression(HP^{(1/2)}),
    y = "Count"
  )

The square root was used (a.k.a \(\lambda = \frac{1}{2}\)).

Note

The cubic root is also an acceptable transformation (a.k.a \(\lambda = \frac{1}{3}\)).

Note

Use of expression() is not necessary for full marks. Something like x = "HP^(1/2)" is acceptable as well.



Question 11

Is the pokémon HP data now acceptably normal?

# To evaluate this properly, you need check a Q-Q plot:

ggplot(poke, aes(sample = hp_t)) +
  stat_qq() +
  stat_qq_line() +
  labs(
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  )

While there is some improvement, the data is still not acceptably normal since, in both the histogram and Q-Q plot, you can see very extreme outlying values which are probably inflating the standard deviation and shifting the mean.

  • Evaluation of normality needs to include, at a minimum, a Q-Q plot.
  • Statistical tests for normality should not be used (given mark deductions).
  • It’s also fine to label the x and y axes as “Expected Quantiles (or Values)” and “Observed Quantiles (or Values)” respectively. The y-axis could also be labelled in its transformed units (e.g., x = "HP^(1/2)).


Question 12

For the variable HP, calculate the following statistics using the original (untransformed) pokémon data:

  • Mean
  • Sample Standard Deviation

You are not allowed to use the mean(), var(), or sd() functions.

# Sample Size
N <- length(poke$hp)

# Mean
m <- sum(poke$hp) / N
m
[1] 68.9588
# Standard Deviation
s <- sqrt(sum((poke$hp - m)^2) / (N - 1))
s
[1] 26.57601


Question 13

For the variable HP, calculate the following statistics using the original (untransformed) pokémon data:

  • 20% Trimmed Mean
  • 20% Winsorized Standard Deviation

You are not allowed to use the mean() function or any function from the WRS2 package (though it’s probably a good idea to use them to check your result).

  • sum()
  • length() / nrow()
  • rep()
  • sort()
  • floor()
  • Indexing with []
# Sample Size
N <- length(poke$hp)

# Sort data
hp <- sort(poke$hp)

# Amount of values to remove from each side
trim <- floor(N * 0.20)

# Subset rows to keep
hp_trim <- hp[(trim + 1) : (N - trim)]

# 20% Trimmed mean
sum(hp_trim) / length(hp_trim)
[1] 66.09563
# Add winsorized values to trimmed data
hp_win <- c(
  hp_trim,
  rep(min(hp_trim), trim), # repeating the min value
  rep(max(hp_trim), trim) # repeating the max value
)

# 20% Winsorized Standard Deviation
sd(hp_win)
[1] 14.44225


Question 14

Use the R functions pnorm() and qnorm() to answer questions a - p.

Note

There are multiple correct ways to apply these functions for some of the questions.

a.

What percentile is a z-score of 0 at?

pnorm(0, mean = 0, sd = 1) * 100
[1] 50

b.

What percentile is a z-score of -1.96 at?

pnorm(-1.96, mean = 0, sd = 1) * 100
[1] 2.49979

c. 

What percentile is a z-score of 1.96 at?

pnorm(1.96, mean = 0, sd = 1) * 100
[1] 97.50021

d. 

What probability does a z-score of 1.96 correspond to?

pnorm(1.96, mean = 0, sd = 1)
[1] 0.9750021

e.

What percentile is a score of 202 at for a normal distribution with a mean of 224 and standard deviation of 29?

pnorm(202, mean = 224, sd = 29) * 100
[1] 22.40397

f. 

What percentile is a score of 224 at for a normal distribution with a mean of 224 and standard deviation of 29?

pnorm(224, mean = 224, sd = 29) * 100
[1] 50

g.

What score is at the 2.5th percentile for a normal distribution with a mean of 224 and standard deviation of 29?

qnorm(0.025, mean = 224, sd = 29)
[1] 167.161

h.

What score is at the 97.5th percentile for a normal distribution with a mean of 224 and standard deviation of 29?

qnorm(0.975, mean = 224, sd = 29)
[1] 280.839

i.

What z-score represents the 2.5th percentile?

qnorm(0.025, mean = 0, sd = 1)
[1] -1.959964

j.

What z-score represents the top 2.5% of a normal distribution?

qnorm(0.025, mean = 0, sd = 1, lower.tail = FALSE)
[1] 1.959964

k.

What z-score is at the 97.5th percentile?

qnorm(0.975, mean = 0, sd = 1)
[1] 1.959964

l.

What score represents the top 2.5% of a normal distribution with a mean of 224 and standard deviation of 29?

qnorm(0.025, mean = 224, sd = 29, lower.tail = FALSE)
[1] 280.839

m.

What score represents the top 5% of a normal distribution with a mean of 224 and standard deviation of 29?

qnorm(0.05, mean = 224, sd = 29, lower.tail = FALSE)
[1] 271.7008

n. 

What z-score represents the top 5% of a normal distribution?

qnorm(0.05, mean = 0, sd = 1, lower.tail = FALSE)
[1] 1.644854

o.

What score represents the top 5% of a normal distribution with a mean of 132 and standard deviation of 57?

qnorm(0.05, mean = 132, sd = 57, lower.tail = FALSE)
[1] 225.7567

p. 

What score represents the bottom 9.3% of a normal distribution with a mean of 132 and standard deviation of 57?

qnorm(0.093, mean = 132, sd = 57)
[1] 56.61721


Question 15

Which distribution contains the largest spread/variability?

Histogram C

Note

To discern the answer, you need to gauge the scale of the x-axis.


Question 16

The following plot is a standard normal distribution. What percentile is represented by the coloured (non-grey) region.

pnorm(-0.5, mean = 0, sd = 1) * 100
[1] 30.85375


Question 17

The following plot is a standard normal distribution. What percentile is represented by the coloured (non-grey) region.

pnorm(-1, mean = 0, sd = 1) * 100
[1] 15.86553


Question 18

The following distribution has a mean of 100 and standard deviation of 15. What percentile is represented by the coloured (non-grey) region.

pnorm(70, mean = 100, sd = 15) * 100
[1] 2.275013


Question 19

The following distribution has a mean of 100 and standard deviation of 15. What percentage of the data is represented by the coloured (non-grey) region.

pnorm(120, mean = 100, sd = 15, lower.tail = FALSE) * 100
[1] 9.121122


Question 20

Bobby’s brother constantly tells him he’s abnormal. To prove this, his brother obtained a sample of test scores from Bobby’s class (Bobby was included in this sample). His classmates’ scores were 70, 68, 75, 72, 76, 74, 69, 82, 73, and Bobby scored a 56. If “abnormal” is operationally defined as being at least 2 standard deviations away from the mean, is Bobby abnormal? Assume the data follows a normal distribution.

bobby <- 56
class <- c(70, 68, 75, 72, 76, 74, 69, 82, 73, bobby)
n <- length(class)
m <- mean(class)
s <- sd(class)

bobby_z <- (bobby - m) / s
bobby_z
[1] -2.289502

A z-score shows Bobby to be 2.29 standard deviations below the mean. Therefore, Bobby’s brother is (technically) correct. Bobby is abnormal.

Note

This could have also been solved by calculating an interval spanning \(\pm\) 2 standard deviations.


Question 21

From the previous question, what is the probability of a student scoring at Bobby’s mark or less?

bob_prob <- pnorm(bobby, mean = m, sd = s)
bob_prob
[1] 0.01102511


Question 22

What is the probability of a student scoring higher than Bobby?

1 - bob_prob
[1] 0.9889749


Question 23

The year is 1987, we have sent out everyone in a large introductory course to check whether people use seat belts. Each student has been told to look at 100 cars and count the number of people wearing seat belts. The number found by any given student is considered that student’s score. The mean score for the class is 44, with a standard deviation of 7. A student who has done very little work all year has reported finding 62 seat belt users out of 100. Do we have reason to suspect that the student just made up a number rather than actually counting? Assume the data collected by the class follows a normal distribution.

p <- pnorm(62, mean = 44, sd = 7)
p
[1] 0.994936

The student’s score of 62 is at, approximately, the 99th percentile, which is exceedingly rare given all the other data collected. It seems very likely the student fabricated their answer.

Other possible ways of solving this:

You could compute a z-score. A z-score that is greater than 1.96 (or 2) standard deviations from the mean would (according to common convention) indicate a statistically rare value and thus suggest the student made up the number.

z <- (62 - 44) / 7
z
[1] 2.571429

You could compute a interval encompassing the middle 95% of the data. If the student’s score of 62 falls outside of that range, it’s probably fabricated. i.e.,

lower <- 44 - 1.96 * 7
upper <- 44 + 1.96 * 7

paste0("[", round(lower, 5), ", ", round(upper, 5), "]") 
[1] "[30.28, 57.72]"


Question 24

From the previous question, what is the minimum and maximum number of seatbelts that could have been “observed” by this student without drawing suspicion? Assume anything in the top and bottom 2.5% of the distribution is suspicious. Round to the nearest seatbelt.

# Min (bottom 2.5%)
round(qnorm(p = 0.025, mean = 44, sd = 7))
[1] 30
# Max (top 2.5%)
round(qnorm(p = 0.975, mean = 44, sd = 7))
[1] 58


Question 25

A researcher counted the number of behavioural responses a sample of participants completed during a period of operant extinction (i.e., the participaints were no longer reinforced/rewarded to perform a task). The data was positively skewed but the researcher nonetheless analyzed the results in terms of means and standard deviations. They reported a mean of \(5.88\) and a standard deviation of \(8.06\). Why was this an incredibly poor way to analyze the results? Hint: consider where \(0\) responses might fall if this distribution was symmetric and unimodal.

resp <- pnorm(q = 0, mean = 5.88, sd = 8.06)
resp
[1] 0.2328392

Using means and standard deviations assumes the data is symmetric. Given the mean and standard deviation they report, approximately 23% of their data must be falling below 0. This implies that they observed participaints completing negative amounts of responding! That doesn’t make any sense.


Question 26

Recall that in Questions 2 and 3 you calculated the middle 95% and 68% boundaries for the following data:

676, 692, 661, 692, 637, 676, 649, 656, 643, 665, 694, 643, 677


Do that again, but this time use R functions to obtain a more precise calculation instead of using the values in the graph that was provided.

x <- c(676, 692, 661, 692, 637, 676, 649, 656, 643, 665, 694, 643, 677)

n <- length(x)
m <- mean(x)
s <- sd(x)

# Middle 95% (2.5% in the tails)
lower_95 <- qnorm(.025, mean = m, sd = s)
upper_95 <- qnorm(.025, mean = m, sd = s, lower.tail = FALSE)
paste0("Middle 95%: [", round(lower_95, 5), ", ", round(upper_95, 5), "]") 

# Middle 68% (16% in the tails)
lower_68 <- qnorm(0.16, mean = m, sd = s)
upper_68 <- qnorm(0.16, mean = m, sd = s, lower.tail = FALSE)
paste0("Middle 68%: [", round(lower_68, 5), ", ", round(upper_68, 5), "]") 
[1] "Middle 95%: [627.15292, 705.30862]"
[1] "Middle 68%: [646.40322, 686.05832]"


Question 27

In a normal distribution, how many standard deviations above and below the mean are the upper and lower boundaries of the middle 68% interval? Give an answer rounded to at least 4 decimal places.

q <- abs(qnorm(0.16, mean = 0, sd = 1))
q
[1] 0.9944579